Automatic feature subset selection for decision tree-based ensemble methods in the prediction of bioactivity
نویسندگان
چکیده
a r t i c l e i n f o In the structure–activity relationship (SAR) study, a learning algorithm is usually faced with the problem of selecting a compact subset of descriptors related to the property of interest, while ignoring the rest. This paper presents a new method of molecular descriptor selection utilizing three commonly used decision tree (DT)-based ensemble methods coupled with a backward elimination strategy (BES). Our proposed method eliminates descriptor redundancy automatically and searches for more compact descriptor subset tailored to DT-based ensemble methods. Six real SAR datasets related to different categorical bioactivities of compounds are used to evaluate the proposed method. The results obtained in this study indicate that DT-based ensemble methods coupled with BES, especially boosting tree model, yield better classification performance for compounds related to ADMET. In modern pharmaceutical industry, structure–activity relationship (SAR), an important area of chemometrics, is urgently needed for predicting ADMET (absorption, distribution, metabolism, excre-tion and toxicity) properties to select lead compounds for optimization at the early stage of drug discovery and to screen drug candidates for clinical trials [1–8]. The aim of SAR is to search information relating the molecular structure to biological activity. In this process, molecular structures are usually represented by a variety of molecular descriptors which are easily calculated by some software packages. In most cases, the number of compounds with the biological activity values available is usually small compared with the number of descriptors. This may lead to several difficult problems for constructing a SAR model undoubtedly. Among these problems we could list: (1) In most cases only a small number of descriptors have substantial influence on the property of interest. A lot of descriptors usually include ones unrelated to the property of interest. The use of such descriptors likely generates noise in an established model, which may affect the prediction accuracy of that model. (2) It is really difficult to determine which descriptors or combinations are responsible for the property of interest. The identification of important descriptors is of fundamental and practical interest. Research in biology and medicine may benefit from the examination of the top ranking descriptors to confirm recent discoveries in new drug research or suggest new avenues to be explored. (3) Much time may be needed to conduct a learning algorithm with a very large number of descriptors. (4) Most of modeling methods may not work well or even may be …
منابع مشابه
Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection
Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...
متن کاملDevelopment of an Ensemble Multi-stage Machine for Prediction of Breast Cancer Survivability
Prediction of cancer survivability using machine learning techniques has become a popular approach in recent years. In this regard, an important issue is that preparation of some features may need conducting difficult and costly experiments while these features have less significant impacts on the final decision and can be ignored from the feature set. Therefore, developing a machine for p...
متن کاملA novel hybrid method for vocal fold pathology diagnosis based on russian language
In this paper, first, an initial feature vector for vocal fold pathology diagnosis is proposed. Then, for optimizing the initial feature vector, a genetic algorithm is proposed. Some experiments are carried out for evaluating and comparing the classification accuracies which are obtained by the use of the different classifiers (ensemble of decision tree, discriminant analysis and K-nearest neig...
متن کاملسودمندی رگرسیونهای تجمیعی و روشهای انتخاب متغیرهای پیشبین بهینه در پیشبینی بازده سهام
مقاله حاضر به بررسی سودمندی رگرسیونهای تجمیعی و روشهای انتخاب متغیرهای پیشبین بهینه (شامل روش مبتنی بر همبستگی و ریلیف) برای پیشبینی بازده سهام شرکتهای پذیرفته شده در بورس اوراق بهادار تهران میپردازد. بهمنظور ارزیابی عملکرد رگرسیون تجمیعی، معیارهای ارزیابی (شامل میانگین قدرمطلق درصد خطا، مجذور مربع میانگین خطا و ضریب تعیین) مربوط به پیشبینی این روش، با رگرسیون خطی و شبکههای عصبی مصنوعی...
متن کاملImproving Accuracy in Intrusion Detection Systems Using Classifier Ensemble and Clustering
Recently by developing the technology, the number of network-based servicesis increasing, and sensitive information of users is shared through the Internet.Accordingly, large-scale malicious attacks on computer networks could causesevere disruption to network services so cybersecurity turns to a major concern fornetworks. An intrusion detection system (IDS) could be cons...
متن کاملPrediction of Molecular Bioactivity for Drug Design Using a Decision Tree Algorithm
A machine learning-based approach to the prediction of molecular bioactivity in new drugs is proposed. Two important aspects are considered for the task: feature subset selection and cost-sensitive classification. These are to cope with the huge number of features and unbalanced samples in a dataset of drug candidates. We designed a pattern classifier with such capabilities based on information...
متن کامل